feat: Add structured Output decoding support for vLLM-spyre#903
feat: Add structured Output decoding support for vLLM-spyre#903R3hankhan123 wants to merge 1 commit intotorch-spyre:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to vLLM support on Spyre. We also recommend installing prek and configuring it to check your code before every local commit. |
6f52711 to
e8ba677
Compare
e8ba677 to
c75c86d
Compare
| ## llguidance is not supported on s390x due to endianness issues. | ||
| if platform.machine() == "s390x": | ||
| backend = self.vllm_config.structured_outputs_config.backend | ||
| if backend == "guidance": |
There was a problem hiding this comment.
Should this return an error to the user instead?
Before this PR, we always silently ignored structured output requests since we didn't support it at all and didn't want to break tool calling integrations that sometimes requested it. But now that we do support structured output, it might be better to return errors where it's misconfigured so that users know to go fix their deployments. Otherwise we'll be in a mixed state where sometimes structured output works, and sometimes it's not applied.
There was a problem hiding this comment.
Now raising RuntimeError instead of warning
There was a problem hiding this comment.
Oh- @R3hankhan123 I'm not sure this can be done here in the model runner though. Any assertions that happen here generally crash the server. We would need to return an error for this request only, and the easiest way to do this is to validate the request before it even hits the engine, which we can do in SpyrePlatform.validate_request
c75c86d to
8fb0923
Compare
|
bot:test |
joerunde
left a comment
There was a problem hiding this comment.
A few things we need to clean up before merging:
- We should probably check for the existence of llguidance rather than doing a platform check on s390x, so that we can be forward-compatible if a version of guidance is released that works on Z
- We should validate the structured outputs config up-front when we validate the request in platform.py to avoid setting the guidance backend when llguidance is not installed
- We should validate that the model is not deployed with guidance as the default structured output backend if llguidance is not installed
- We should add tests for these validation cases, as well as one small test that calls a model with a structured output option so we can ensure it actually works
8fb0923 to
ed4f2b2
Compare
|
@joerunde llgudiance version 1.7.3 contains the fix and i have added it in the overrides of pyproject, can you take a look thanks |
4b0cf02 to
31a7742
Compare
|
@R3hankhan123 let's update the PR description to remove the bit about llguidance not supported on s390x Also before merging we still need to encode the examples from the description into unit tests. We should probably flex all the different backends, and be sure that a prompt in the batch that doesn't request structured outputs does not accidentally have structured outputs applied. See |
36f5ff2 to
7b781d5
Compare
7b781d5 to
940d2ad
Compare
| # Verify output is valid JSON | ||
| try: | ||
| json_obj = json.loads(output_text) | ||
| assert isinstance(json_obj, dict), "Output should be a JSON object" |
There was a problem hiding this comment.
I'm skeptical that the micro models we use for testing will actually always generate an object- but if this passes on spyre for both fp16 and fp8 then it's probably okay
940d2ad to
813de11
Compare
| ) | ||
|
|
||
| # Generate with structured output | ||
| output_structured = spyre_model.generate([prompt_structured], [params_structured])[0] |
There was a problem hiding this comment.
these need to be a single request, like
output = spyre_model.generate([prompt_structured, prompt_freeform], [params_structured, params_freeform])
the point is to ensure that requests that are running within the same batch can have different structured output setttings
There was a problem hiding this comment.
@joerunde using this e2e tests is failing with
(EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] raise self._exception
(EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] File "/home/runner/work/sendnn-inference/sendnn-inference/.venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 84, in collective_rpc
(EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] File "/home/runner/work/sendnn-inference/sendnn-inference/.venv/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 510, in run_method
(EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] return func(*args, **kwargs)
(EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] File "/home/runner/work/sendnn-inference/sendnn-inference/.venv/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py", line 332, in execute_model
(EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] return self.worker.execute_model(scheduler_output)
(EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] File "/home/runner/work/sendnn-inference/sendnn-inference/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] return func(*args, **kwargs)
(EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] File "/home/runner/work/sendnn-inference/sendnn-inference/.venv/lib/python3.12/site-packages/vllm_spyre/v1/worker/spyre_worker.py", line 770, in execute_model
(EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] output = self.model_runner.execute_model(scheduler_output)
(EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] File "/home/runner/work/sendnn-inference/sendnn-inference/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] return func(*args, **kwargs)
(EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] File "/home/runner/work/sendnn-inference/sendnn-inference/.venv/lib/python3.12/site-packages/vllm_spyre/v1/worker/spyre_model_runner.py", line 1541, in execute_model
(EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] self.maybe_setup_new_prefill(scheduler_output)
(EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] File "/home/runner/work/sendnn-inference/sendnn-inference/.venv/lib/python3.12/site-packages/vllm_spyre/v1/worker/spyre_model_runner.py", line 1478, in maybe_setup_new_prefill
(EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] assert len(scheduler_output.scheduled_new_reqs) == 1, (
(EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] AssertionError: Can only schedule one chunked prefill at a time
| assert "name" in json_obj, "Output should have 'name' field" | ||
| assert "age" in json_obj, "Output should have 'age' field" | ||
| assert isinstance(json_obj["name"], str), "'name' should be a string" | ||
| assert isinstance(json_obj["age"], int), "'age' should be an integer" |
| @@ -0,0 +1,355 @@ | |||
| """End-to-end tests for structured output decoding. | |||
There was a problem hiding this comment.
The tests here in general are pretty great!
There's one big problem here though- we're not using cached models which will really slow down the testing. On spyre if we don't use the model cache then it takes ~30s to reload a new model (for g3.3-micro that we test with). The choices here are either:
- Consolidate all the cases into a single test so we load the model once with each backend and send a few batches of prompts to it
- Add the
structured_output_backendas a model cache key and use the cached llm with that backend
I'd prefer (2), it should be pretty quick to get that hooked up in the model cache. In our CI runs we also have the benefit of running all the tests with a cached LLM in a single process instead of forking the tests which is much faster as well
There was a problem hiding this comment.
@joerunde i have added code for option 2, can you review it and let me know if its the correct approach, thanks
There was a problem hiding this comment.
@R3hankhan123 there's a sort key class that probably needs to be updated as well, it needs to sort these test cases so that cases with identical LLMs are grouped together
52b09f2 to
13edcc3
Compare
Add support for structered decoding/structered output for sendnn-inference Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
13edcc3 to
3902db0
Compare
|
It also looks like there's a test failure with the mixed batch: so something is wrong and we've violated the constraint that we can only prefill a single request at a time |
Description
Add structured decoding support for vllm spyre.
Test Plan
Run the vllm server and set structured output backend to xgrammar and outlines
Test output
With outlines
Checklist
bash format.sh)Signed-off-by:line (DCO compliance)